# +~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~ #  
#
#' @title   Validating our party-level measurements against CHES data	
#' @author  Hauke Licht
#' 
#' @note    Internet access needed to run this script (to download CHES data)
#
# +~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~ #

# setup ----

library(readr)
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)

# data paths 

base_path <- file.path(".")
data_path <- file.path(base_path, "data")
output_path <- file.path(data_path, "output")

# read data ----

# read party codes mapping
party_codes <- read_csv(file.path(data_path, "exdata", "party_codes_mapping.csv"))

# read party level estimates 

all_tweets_labeled <- read_rds(file.path(output_path, "parl_party_tweets_labeled.rds"))

# read CHES data ----

# list of CHES metadata
ches <- list(
  "2014" = list(
    csv_file = "https://www.chesdata.eu/s/2014_CHES_dataset_means.csv"
    , vars = c("country" = "cname", "party_id", "party" = "party_name", "lrgen", "antielite_salience")
    , field_time_end = ymd("2014-11-30")
    , country_mapping_col = 2
  )
  , "2019" = list(
    csv_file = "https://www.chesdata.eu/s/CHES2019V3.csv"
    , vars = c("country", "party_id", "party", "lrgen", "antielite_salience")
    , field_time_end = ymd("2020-01-31")
    , country_mapping_col = 3
  )
)

ches_country_abbreviations <- tibble::tribble(
  ~countr_iso3c, ~ches2014, ~ches2019,
  "AUS",   NA ,   NA ,
  "AUT", "aus",   13L,
  "BEL", "bel",    1L,
  "CAN",   NA ,   NA ,
  "DNK", "den",    2L,
  "FIN", "fin",    14,
  "FRA", "fra",    6L,
  "DEU", "ger",    3L,
  "GRC", "gre",    4L,
  "IRL", "ire",    7L,
  "ITA",  "it",    8L,
  "LUX", "lux",   38L,
  "NLD", "net",   10L,
  "NZL",   NA ,   NA ,
  "NOR", "nor",   35L,
  "PRT", "por",   12L,
  "ESP", "spa",    5L,
  "SWE", "swe",   16L,
  "CHE", "swi",   36L,
  "GBR",  "uk",   11L,
)

read_ches_data <- function(args) {
  args$data <- read_csv(args$csv_file) %>% 
    select(!!args$vars) %>% 
    left_join(
      ches_country_abbreviations %>% 
        select(1, args$country_mapping_col) %>% 
        rename_at(2, ~"country")
      , by = "country"
    )
  args$country_mapping_col <- NULL
  args$downloaded_at <- now()
  
  return(args)
}

# read CHES data
ches_data <- map(ches, read_ches_data)

ches_estimates <- ches_data %>% 
  map("data") %>% 
  map_dfr(select, -1, .id = "year") 

# map pre-12 month to CHES estimates ----

ches_party_averages <- map(ches, "field_time_end") %>% 
  imap_dfr(function(edate, .year) {
    party_mean <- all_tweets_labeled %>% 
      filter(
        political == "yes"
        , as_date(created_at) <= edate
        , as_date(created_at) >= (edate-months(12))
      ) %>% 
      group_by(country_iso3c, party_id, party_name_short, year = .year) %>% 
      summarise(
        prop_elitecriticism = mean(elitecriticism == "yes", na.rm = TRUE)
        , mean_prob_elitecriticism = mean(prob_elitecriticism, na.rm = TRUE)
        , n_tweets = n()
        , .groups = "keep"
      ) %>% 
      ungroup() 
     
    party_mean %>% 
      left_join(
        select(party_codes, country_iso3c, party_id, party_name_short, party_id_ches, to_keep)
        , by = c("country_iso3c", "party_id", "party_name_short")
      ) %>% 
      left_join(
        select(ches_data[[.year]]$data, -country)
        , by = c("country_iso3c" = "countr_iso3c", "party_id_ches" = "party_id")
      )
  })

fp <- file.path(output_path, "validation", "party_averages_own_vs_ches_estimates.rds")
if (!file.exists(fp))
  write_rds(ches_party_averages, fp)

# compute convergent validity as a function of number of quarters aggregated ----

n_quarters <- c(8, 6, 4, 2) # = 8*3, 6*3, etc. months

tmp <- map(n_quarters, function(.q) {
    imap_dfr(
      map(ches, "field_time_end")
      , function(edate, .year) {
        party_mean <- all_tweets_labeled %>% 
          filter(
            political == "yes" # !!!
            , as_date(created_at) <= edate
            , as_date(created_at) >= (edate-months(.q*3))
          ) %>% 
          group_by(country_iso3c, party_id, party_name_short, year = .year) %>% 
          summarise(
            prop_elitecriticism = mean(elitecriticism == "yes", na.rm = TRUE)
            , mean_prob_elitecriticism = mean(prob_elitecriticism, na.rm = TRUE)
            , n_tweets = n()
            , .groups = "keep"
          ) %>% 
          ungroup() 
        
        party_mean %>% 
          left_join(
            select(party_codes, country_iso3c, party_id, party_name_short, party_id_ches, to_keep)
            , by = c("country_iso3c", "party_id", "party_name_short")
          ) %>% 
          left_join(
            select(ches_data[[.year]]$data, -country)
            , by = c("country_iso3c" = "countr_iso3c", "party_id_ches" = "party_id")
          )
      })
  })

compute_correlations <- function(x, .q) {
  x %>% 
    filter(n_tweets >= 100) %>% 
    group_by(year) %>% 
    summarize(
      r = cor(mean_prob_elitecriticism, antielite_salience, use = "pairwise.complete.obs")
    ) %>% 
    mutate(quarters = .q)
}
names(tmp) <- c(8, 6, 4, 2)
res <- imap_dfr(tmp, compute_correlations)

fp <- file.path(output_path, "validation", "ches_party_correlations_detailed.csv")
if (!file.exists(fp))
  write_csv(res, fp)
